-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[v1][WIP] Metrics & Stats prototype #10651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: rickyx <[email protected]>
626d716 to
0a666e7
Compare
Signed-off-by: rickyx <[email protected]>
Signed-off-by: rickyx <[email protected]>
|
|
||
| # TODO: can we eliminate these? | ||
| async def _run_stats_handler(self): | ||
| while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where stats were polled from the backend.
| request_type = type_frame.buffer | ||
| request_data = data_frame.buffer | ||
|
|
||
| if request_type == EngineCoreRequestType.STATS.value: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the backend process responds. Note this is independent from the actual execution thread.
Signed-off-by: rickyx <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: rickyx <[email protected]>
Signed-off-by: rickyx <[email protected]>
Signed-off-by: rickyx <[email protected]>
| free_encoder_input_ids=self.encoder_cache_manager.get_freed_ids(), | ||
| ) | ||
|
|
||
| if self.log_stats: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that got us in trouble in V0 is too many loops over these types of data structures.
I wonder if the creation of the request stats could be done in the list comprehensions used above,
|
This pull request has merge conflicts that must be resolved before it can be |
| detokenizer_req, engine_core_req = self.processor.process_inputs( | ||
| request_id, prompt, params, arrival_time, lora_request, | ||
| trace_headers, prompt_adapter_request, priority) | ||
| self.stats_manager.record_engine_input(engine_core_req) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this for LLMEngine as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yes - the prototype was only on async llm, but it should be the same for sync llm engine.
| stat_logger.log(stats) | ||
|
|
||
|
|
||
| class ThreadSafeEngineStatsManager(EngineStatsManager): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are not any threads in the AsyncLLM class right now (its just using asyncio. So the Threadsafe variant is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right - good catch. I wonder in the future we should probably isolate the stats logger in another thread since the stats logger could be pretty expensive (e.g. the prometheus logging is quite slow when it comes to histograms).
| # on the background thread. | ||
| # 2. the main event loop when a request is added (and input | ||
| # processed). | ||
| self.stats_manager = ThreadSafeEngineStatsManager( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are not any threads in this class (just asyncio) --- so we don't need a threadsafe variant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this design is on the right track. As discussed in slack (https://vllm-dev.slack.com/archives/C07QTE01QFQ/p1732588846571919), I think that we should remove the EngineStatsAgent from the EngineCore and instead send the stats back in EngineCoreOutputs each step.
This will simplify the state management in the EngineCore and reduce the number of RPCs we have to worry about with the same amount of work being executed on the core busy loop as the current design.
|
Closing as stale, feel free to re-open if you feel this PR is still relevant. |
Requirements and Goals
Proposal and Design
RequestStatsandRequestStatsUpdateto capture requests updates for metrics. It captures the entire request lifecycle (arrival -> input process -> engine core prefill/decode -> [preempt] -> detokenized).EngineStatsAgentthat lives with theEngineCoreto collect stats updates from various core components and scheduler/model outputsEngineStatsManagerclass that lives on theAsyncLLMengine that aggregates stats from the agent.On Backend (EngineCore Process):
EngineStatsAgentcollects metrics in a thread-safe mannerOn Frontend (Main Process):
EngineStatsManageraggregates stats from backend as well as collect stats on the engine frontend.Benchmark Results
TLDR: There's no observable perf regression when stats collection happens every 1 sec for the below workloads comparing to stats collection turned off.
Throughput
Latency